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Abstract 

While Gaussian probability densities are om- 
nipresent in applied mathematics, Gaussian 
cumulative probabilities are hard to calcu- 
late in any but the univariate case. We of- 
fer here an empirical study of the utility of 
Expectation Propagation (EP) as an approx- 
imate integration method for this problem. 
For rectangular integration regions, the ap- 
proximation is highly accurate. We also ex- 
tend the derivations to the more general case 
of polyhedral integration regions. However, 
we find that in this polyhedral case, EP's an- 
swer, though often accurate, can be almost 
arbitrarily wrong. These unexpected results 
elucidate an interesting and non-obvious fea- 
ture of EP not yet studied in detail, both for 
the problem of Gaussian probabilities and for 
EP more generally. 



1 Introduction 

This paper studies approximations to definite integrals 
of Gaussian distributions po( x ) = jV(x; m, K). De- 
spite the battery of convenient analytic characteristics 
of Gaussian densities, Gaussian probabilities are dif- 
ficult to calculate: the cumulative distribution func- 
tion (cdf) has no closed- form expression and is numer- 
ically challenging in high-dimensional spaces. Here we 
consider a generalisation of the cdf: the probability 
F{A) = Prob{x e .4} that a draw from po(x) falls 
in a (possibly unbounded) polyhedral region A. Ap- 
plications of these probabilities are widespread, e.g. 
Bock and Gibbons (1996); Lesaffre and Molenberghs 
(1991); Thiebaut and Jacqmin-Gadda (2004); Liao 
et al. (2007); Gassman ct al. (2002). Three machine 
learning problems that can be cast as Gaussian prob- 
abilities include the Bayes Point Machine (Hcrbrich, 



2002), Gaussian Process classification (Rasmussen and 
Williams, 2006; Kuss and Rasmussen, 2005), and pro- 
bit regression (Ochi and Prentice, 1984). 

Univariate Gaussian probabilities can be calculated 
with machine-level precision, but no similarly powerful 
algorithm exists for the multivariate case. There are 
special cases where analytic decompositions (Plack- 
ctt, 1954; Hugucnin et al., 2009) or sampling meth- 
ods (Lerman and Manski, 1981; Pakes and Pollard, 
1989) are possible though very costly, but the most 
efficient general, known method is numerical integra- 
tion (the recent book Genz and Bretz (2009) gives a 
good overview) . The Genz methods represent the state 
of the art, but achieving high accuracy invokes sub- 
stantial computational cost. Also, applications such 
as Bayesian model selection require analytic approxi- 
mations of F (A), usually because the goal is to opti- 
mise F{A) with respect to the parameters {m, K} of 
the Gaussian, which requires the corresponding deriva- 
tives. These derivatives and other features are not 
currently offered by numerical integration methods. 
Thus, there remains significant work to be done to 
address this important problem. 

Here we develop an analytic approximation to F(A) 
using Expectation Propagation (EP) (Minka, 2001a,b, 
2005; Opper and Winther, 2001, 2000) as an approx- 
imate integration method. In its most general form, 
this algorithm is applicable to polyhedral integration 
regions, which include the most commonly required 
hyperrectangular case (which, in turn, generalises the 
cdf). For hyperrectangular A, the approximations are 
of high quality. For polyhedral A, they are sometimes 
considerably worse. This shortcoming of EP will come 
as a surprise to many readers, because the differences 
to the rectangular case are inconspiciously straightfor- 
ward. We study this interesting result, which has im- 
plications for some important applications where EP is 
often used. In particular, the geometric interpretation 
of the Gaussian probability problem provides an in- 
sightful testbed for analysing properties of EP, which 
has importance for approximate inference well beyond 
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Gaussian probabilities. 



2 EP for Gaussian Probabilities 

A prototypical aspect of Bayesian inference is that an 
intractable p(x) can be written as a product of a prior 
Po( x ) and likelihood factors £j(x) such that p(x) = 
Po ( x ) II i U ( x ) • Our Gaussian probability problem is 
the normalisation constant of such a distribution: 

F(A) = J p (x)dx = y*p(x)dx= /"p (x)fji i (x)dx 

(1) 

where m can be any number of terms, greater than 
or less than or equal to the dimensionality n of po (x) . 
The factor ii(x) is an indicator function defined in a 
particular direction Cj, namely a "box function": 



fi(x) = l{k < cfx < Ui} 



k < cf x < 
otherwise. 



(2) 

Note that either or Uj can be infinite or inactive (not 
supporting the polyhedron), and thus this polyhedral 
definition is general. Using box functions instead of 
step- functions for this definition may be non-standard, 
but this choice will be useful in our analyses, and it 
reduces to hyperrectangular A by setting m = n and 
Cj to the cardinal axes. 

EP approximates the intractable p(x) by itera- 
tively replacing true factors £i(x) with approximate 
factors tj(x) from an exponential family, yielding 
the tractable approximate unnormalised distribution 
g(x) = po( x ) Yii ^i( x )- The EP objective function is in- 
volved, but EP is motivated by the idea of minimising 
the Kullback-Leibler divergence Dkl(p\\q)- F° r ?( x ) 
in the exponential family, this is equivalent to global 
moment matching (including the zcroth moment) be- 
tween q(x) and p(x). 

The details of EP are by now standard in our commu- 
nity, so we give only a short description in the interest 
of brevity. At any point in the iteration, EP tracks 
a current approximate g(x) in an exponential family, 
and a set of approximate factors (or messages) ti, also 
in the family. The factors are updated by constructing 
a cavity distribution 



«\*(x) 



<z( x ) 

?i(x)' 



(3) 



and then projecting into the exponential family 

? i (x)g\ i (x)= P roj[i i (x)g\ i (x)], (4) 

where this projection operation (an M-projection from 
information geometry (Koller and Friedman, 2009)) is 
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Figure 1 : The effect of whitening the Gaussian probability 
based on the mean and covariance of the Gaussian. 



defined as setting ij(x) to the unnormalised member 
of the exponential family minimizing I?KL(£i9^ I ||ii9^ 1 ) 
(Minka, 2004). This projection entails matching 
the sufficient statistics of ti(x)q\ i (x) to those of 
ii(x)g^(x), which for Gaussian fj(x) is equivalent to 
matching zeroth, first and second moments. 

Our use of EP is nonstandard in two ways. First, our 
factors tj(x) are unnormalised and thus so is p(x) and 
<?(x) - here we use EP not to construct parametric 
marginal distributions, but to calculate log partition 
functions. Second, we note that the moment matching 
projection in Equation 4 is tractable for any rank-one 
factor £,(x), not simply the usual axis-aligned likeli- 
hood terms where Cj are the cardinal axes. Simple 
rank-one updates make this algorithm computation- 
ally efficient. Critically, this generality allows us to 
consider probabilities F(A) both in axis-aligned hy- 
perrectangular cases and arbitrary polyhedra of any 
orientation and any number m of constraints/faces, 
more or less than the dimensionality n of po (x) . 

It will be useful to consider whitening the space of the 
problem via a substitution y = L _1 (x — m) where L is 
a Cholesky factor of the covariance K. In this trans- 
formed space, we have a standard JV(0, 1) Gaussian, 
and we have a transformed set of polyhedral faces L T Ci 
as in Figure 1. To understand sources of error, we can 
then attend to properties of this transformed region. 
Intuition suggests that the more (less) hyperrectangu- 
lar this transformed region is, the better (worse) the 
performance is, as hyperrectangular cases (in a white 
space) decompose into a product of solvable univariate 
problems. We then see that hyperrectangular cases are 
in fact not a fundamental distinction other than hav- 
ing the same number of constraints m as dimensions 
n. Considering the whitened space and transformed 
integration region is conceptually useful and theoreti- 
cally sound, as EP with Gaussians is invariant to this 
linear transformation (Seeger, 2008). 

The resulting algorithm, EP for multivariate Gaussian 
probabilities (EPMGP), yields a nice geometric inter- 
pretation. We want to integrate a Gaussian over a 
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polyhedron defined by several box functions, but this 
operation is intractable. Instead, EP allows us to re- 
place each of those intractable truncations functions 
with soft Gaussian truncations ?j(x) for which multi- 
plication is simple. 

There are a number of attractive features of the EP 
approach which may motivate its use in many applied 
settings, including fast runtime, low computational 
overhead, and a natural ability to calculate deriva- 
tives with respect to the distribution's parameters, tail 
probabilities, and first and second moments of p(x) . 

3 Experiments and Results 

To evaluate EP for this application, we need to es- 
tablish a baseline for calculating these probabilities, 
and we need to choose test cases. Our baseline will 
be the Genz method, currently the gold standard for 
these calculations, which makes a series of transfor- 
mations of the region A and the Gaussian po( x ) to 
enable accurate numerical integration. The integrand 
of Equation 1 is transformed to the unit cube, and 
heuristic choices on integration order are made be- 
fore quasi-random integration or lattice-point rules are 
used (Genz and Bretz, 2009). For our comparisons to 
EPMGP, we used the Genz method with 5 x 10 5 lattice 
points. 

We note that Monte Carlo methods are not partic- 
ularly well suited to this problem: rejection sam- 
plers from po (x) , importance samplers over the polyhe- 
dron A ( "pinball" or "hit-and-run" MCMC algorithms 
(Herbrich, 2002)), and elliptical slice sampling (Mur- 
ray et al., 2010) do not perform favorably (the latter 
is best among samplers, but still well beneath EP and 
Genz). In light of the accuracy of the Genz algorithms, 
we will not consider samplers further here. 

We next describe choosing our problem cases. To de- 
fine a Gaussian p (x) , we draw eigenvalues from an ex- 
ponential distribution and eigenvectors uniformly from 
the unit hypersphcre to form the covariance K. This 
procedure produces a more interesting range of K - in 
particular a better spread of condition numbers - than 
using a Wishart or similar. We set the mean m = 
without loss of generality. 

In the hyperrectangular integration cases (where di- 
mension n equals number of constraints m), A can be 
defined by the upper and lower bounds u and 1, which 
we defined by taking a draw fromp (x) and adding and 
subtracting uniform random lengths that scale with 
dimensionality n. Having the region size scale with 
the dimensionality n helps to prevent the probabilities 
F{A) from becoming vanishingly small with increasing 
dimension (which is handled fine by EP but problem- 



atic for the Genz method). In the arbitrary polyhe- 
dral case, the same procedure is repeated for drawing 
u and 1 for all m constraints. We also choose axes of 
orientation, which are unit-norm vectors £ IR™ for 
each of the m factors, again uniformly from the unit 
hypersphere. 

It is also useful to consider special cases where we can 
analytically find the true probability F(A). Orthants 
generalise quadrants to high dimension, for example 
{x £ R n : Xi > Vi}. For zero- mean Gaussians in 
R 2 , calculating the probability F(A) when A is an 
orthant is a simple geometry problem, and there are 
also geometric solutions in IR 3 , IR 4 , and R 5 (Genz and 
Bretz, 2009; Sinn and Keller, 2010). 

3.1 EP results for hyperrectangular and 
polyhedral integration regions 

Figure 2A-C shows the empirical results for hyper- 
rectangular integration regions. For each dimension 
n = {2,3,4,5,10,20,50,100}, we chose 250 random 
Gaussians po (x) and random regions A, and we calcu- 
lated the relative error between the EPMGP and Genz 
estimates. Each of these errors is plotted against di- 
mensionality n as a point in Figure 2A, with added 
horizontal jitter to show the distribution of points. 
The blue line and error bars represent the median and 
{25%, 75%} quartiles. The black line below the EP 
results is the median error estimate given by the Genz 
method. Since this line is almost always a few orders 
of magnitude below the distribution of EP errors, it is 
safe to use the high accuracy Genz method as a proxy 
to ground truth. Figure 2A demonstrates that the EP 
method gives a reliable estimate of Gaussian probabil- 
ities with hyperrectangular integration regions, with 
relative errors typically on the order of 10~ 4 and with 
individual errors rarely in excess of 1%. 

Figure 2B plots the errors by the value of log Z itself, 
where we use the log Z calculated from Genz. On this 
panel we can also plot a 1% error for Z ss F(A), as the 
log scale can suggest misleadingly large or small errors. 
Though it is a common intuition that this method 
should have high error in tail probabilities, there is 
instead a weakly positive or nonexistent correlation 
of error with log Z. Panel C repeats the same errors 
but plotted by the condition number of the covariance 
K and shows pronounced trend, which is expected by 
considering Figure 1. 

Figure 2D-F has the same setup as 2A-C, except 
we test over polyhedral integration regions, which 
amounts to additionally drawing m = n random poly- 
hedral faces Ci . The data tell a similar story but with 
surprisingly higher errors. The black line in Panel D 
shows the median EP errors in the axis aligned case 
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Figure 2: Empirical results for EPMGP with hyp errect angular and polyhedral integration regions. See text. 



(i.e., the blue line in Figure 2A). The purpose of this 
line is to show that indeed for the same Gaussian cases, 
the error rate of polyhedral EPMGP is typically an or- 
der of magnitude or two higher than in the hyperrect- 
angular case. Genz errors are not shown. Again Figure 
2E shows a modest positive effect in logZ. The only 
other change is Figure 2F, where we use the concept of 
whitening the space to consider the transformed region 
integrated over a standard Gaussian as in Figure 1. We 
consider this region as a matrix C with columns equal 
to L T Ci. To test the transformed region's proximity 
to / (a decomposable problem with no EP error), we 
use the condition number of C . Conveniently, in the 
hyperrectangular case this metric is the square root of 
the condition number of the covariance K. This allows 
us to compare Figures 2C and 2F, since this square 
root is just a constant scaling on the log scale shown. 
Figure 2F confirms our intuition that less hyperrectan- 
gular transformed integration regions will have higher 



error 1 . 

The sensitivity of error to the number of polyhedral 
constraints m is also an important question. Figure 
2G-I gives these empirical results: Panel G plots errors 
by number of polyhedral face constraints m instead of 
by Gaussian dimension n. In this row of the figure, all 
cases use n — 10 dimensions, and we show polyhedral 
sizes of m — {2,4,8,10,12,16,32,64}. Larger num- 
bers of constraints/polyhedral faces do imply larger 
error, trending up by roughly an order of magnitude 
or two. The errors shown in Figure 2H seem largely 
invariant to the value logZ itself. However, Panel I 
still indicates some upward trend, as we would expect. 

Figure 2D-I, namely the arbitrary polyhedral cases, 
show error rates that may in many cases be unaccept- 



1 There are a number of other sensible metrics on the 
transformed region, and we have produced similar plots 
with Frobenius and l\ norms of the scaled Gram matrix 
±C' T C. The error trend is the same. 
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able. Though EP still performs well often, the relia- 
bility of the algorithm is not what it is in the hyper- 
rectangular case of Figure 2A-C. 

Next, we calculate true orthant probabilities analyti- 
cally: Figure 3 shows errors for both the EP method 
(colour) and the Genz method (grayscale). The two 
panels are the usual errors plotted as a function of 
condition number of covariance (Panel A) and the true 
probability logZ (Panel B). The four cases shown are 
orthant probabilities at n = {2, 3, 4, 5}. First, we note 
that there is a clear separation between the Genz errors 
and the EP errors, which helps to solidify the earlier 
claim that the Genz numerical answer can be used as 
a reasonable proxy to ground truth. 

Second, it is interesting to note that there is significant 
structure in the EP errors with orthant probabilities 
when plotted by log Z in Panel B. Each dimension has 
a "V" shape of errors. This can be readily explained by 
reviewing what an orthant probability is. For a zero- 
mean Gaussian, an orthant probability in IR 2 is simply 
the mass of one of the quadrants. If that quadrant 
has a mass of \ogZ = log 0.25, then the correlation 
must be white, and hence EP will produce a highly 
accurate answer (the problem decomposes). Moving 
away from white correlation, EP will produce error. 
This describes the shape of the red curve for n = 2, 
which indeed is minimised at log0.25 w —1.39. The 
same argument can be extended for why there should 
be a minimum in IR 3 at logZ = log 0.125, and so on. 

In summary, the empirical results of Figures 2 and 
3 indicate that EPMGP is a successful candidate al- 
gorithm for Gaussian probabilities. The error rate is 
non-zero but generally quite low, with median errors 
less 10 -4 and individual errors rarely in excess of 1% 
across a range of dimensions, which may be acceptable 
in many applied settings. 

3.2 Pathological cases 

Figure 4 shows pathological cases designed to illustrate 
the shortcomings of EP. In all cases, we want to inte- 
grate the 7V(0, /) Gaussian in n = 2 dimensions over 
the [—1,1] x [—1,1] box, which is simply the product 
of two axis-aligned univariate box functions (ti(a;i) = 
I{-1 < x 1 < 1} and t 2 (x 2 ) = I{-1 < x 2 < 1}). The 
y-axis shows the errors as usual, and the x-axis is a 
feature chosen to create large errors. 

First, when looking at a transformed integration re- 
gion, EP error may derive from approximate factors 
<i(x) being not orthogonal, and thus overcounting the 
mass when forming the EP approximation q(x). An 
extreme, contrived example of this is adding two re- 
peats of identical factors ^(x). Though the integra- 
tion region is unchanged, EP is not invariant to this 



change, as it still must make an approximate fac- 
tor for each true factor. Figure 4A shows this re- 
dundancy pathology. We use EP to solve the same 
probability with increasing numbers of redundant fac- 
tors, adding up to 1000 copies of the same integra- 
tion region. Mathematically, we go from using EP 
to solve J po(x)t(xi)t(x 2 )dx, to using EP to solve 
/ p (x)t(a;i) 1000 t(a;2) 1000 dx. Though the true F(A) is 
unchanged, EP becomes arbitrarily bad. EP in this re- 
dundancy case is underestimating the true probability, 
for reasons explained in the next section. 

Second, EP error may derive from the approximate 
factors accounting for mass inappropriately. When 
two £j(x) are not orthogonal, moment matching will 
include mass that is outside the integration region. 
Figure 4B shows an integration region that is still the 
[—1,1] x [—1,1] box, but that can be described by the 
intersection of two boxes of any size, as the cartoon 
shows. However, EP must consider the mass that ex- 
ists in each true box factor individually. Hence we 
expect and indeed see that EP will not be invariant to 
this change, and further that EP should overestimate 
the true probability when there is extra mass, which 
is the term we use to describe this pathology of EP. 
Note that the case where there is no extra mass - Panel 
B at left - is still a case where there are two redundant 
boxes, and hence we expect the underestimation error 
from Panel A (corresponding points circled in blue). 
Lastly, Figure 4C shows that preprocessing the con- 
straints (by removing redundant factors or tightening 
factor limits) can not fix these pathologies: increasing 
numbers of slightly rotated boxes minimally describe 
A, but the same magnitude of errors is seen. 

These redundancy and extra mass issues are of course 
two effects of a single underlying problem: EP enforces 
a Gaussian approximation to a hard box-function fac- 
tor, which is a highly non-Gaussian form (but better 
than a conventional halfspace step function). This ef- 
fect is particularly pronounced for sharp box factors, 
but not limited to them: factors with weaker or heav- 
ier tails, and non-symmetric factors, all have similar 
issues. Thus redundancy and extra mass generally go 
hand in hand: if there are likelihood terms that are 
not orthogonal with respect to the prior covariance K, 
they will each have to consider both mass already con- 
sidered by another factor (redundancy) and mass that 
lies outside the actual polyhedron (extra mass). 

4 Correcting EP Errors 

These empirical results suggest two ways to improve 
EP for multivariate Gaussian probabilities, and they 
shed further light on the underpinnings of EP's failure 
modes. 
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Figure 3: Empirical results for orthant probabilities. See text for description. 
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Figure 4: Empirical results from intentionally pathological cases. See text for description. 



4.1 a-divergence perspective 

The redundancy and extra mass problems can be 
viewed as changing the effective a-divergence that EP 
minimises. The a-divergence D a (p \\ q) is a general- 
isation of the KL-divergence, and its relationship to 
EP and other variational schemes has been studied 
(Minka, 2005). The Power-EP algorithm is an exten- 
sion of EP where the factors are raised to a specific 
power cti before moment matching (Minka, 2004) . One 
of the motivations for this algorithm is that raising 
the factor tj to a power can make the projection more 
tractable. That paper also notes that running EP on 
a model where each factor U is repeated rii times is 
equivalent to running Power-EP (with the factors un- 
repealed) with a.i = 1/rii. Importantly, the Power- 
EP algorithm can also be derived as the minimisation 
of local a^-divergences. This repetition of factors is 
precisely the "redundancy" issue that we created by 
repeating constraints rii times in Figure 4A. Our EP 
algorithm in the pure redundant case can thus be in- 
terpreted as running Power-EP with on = l/ri^. 

First, we note that the zeroth moment Z is always un- 
derestimated by a-divergence with a < 1. Also, Z is 
only correctly estimated for a ~ 1 (KL-divergence), 
and it is overestimated for inclusive a-divergences 
which have a > 1. This provides a theoretical ex- 



planation for the systematic underestimation of Z in 
the redundancy case of Figure 4A: EP in this case 
is the same as Power-EP with on = 1/rii < 1. As 
in regular EP, a subtle aspect here is that Power-EP 
minimises the a-divergence locally, whereas the over- 
estimation/underestimation of the zeroth moment Z 
holds for global a-divergence minimisation, and the 
relationship between the two is not yet fully under- 
stood. However, our construction in Figure 4A was 
a fully factorised case, so there is direct correspon- 
dence between global and local divergences, and as 
such we did systematically observe underestimation in 
these experiments. Our experiments indicate similar 
underestimation even when there is not direct global- 
local correspondence. 

Second, this relationship between EP and a-divergence 
gives us a notion of effective a-divergence a c g. In 
particular, running EP with the factor ti repeated mi 
times is the same as local a-divergence minimisation 
by Power-EP with a ff = l/m^. When the constraints 
are almost-repeated (they have a large dot product, 
but are not fully colinear, such as Figure 4C), we lose 
the exact correspondence between EP and Power-EP, 
but we could still imagine that EP corresponds to 
Power-EP with a suitable — < a c g < 1 in this case 
also (using a continuity argument). 
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Figure 5: a' corrections to the rotated box example. The 
line a' = 1 (no correction) is the same curve as Figure 4C. 



Figure 6: a' corrections to the extra mass example. The 
line a' = 1 (no correction) is the same curve as Figure 4B. 



With these facts, we can try to correct for this a e s by 
using Power-EP with a' = 1/ a e ff instead of standard 
EP. Thus we can use the above arguments directly 
in Figure 4A to remove the errors entirely. For box 
factors that are repeated m, times, we correct with 
a[ — l/a e ff — rrii. Empirical findings confirm this 
theoretical treatment, and the errors in Figure 4A are 
entirely removed by this Power-EP correction. In Fig- 
ure 5 we repeat the experiment of Figure 4C, but for a 
range of a' corrections. We use a single correction for 
all factors ij(x), though in general each factor may be 
corrected individually. The a' = 1 curve in Figure 5 is 
the same as Figure 4C (up to a sign change). We then 
plot the same curve for many different corrections, and 
those black curves indicate that indeed there is an op- 
timal a' correction that allows the EP estimate to be 
error free. In the inset panel we plot those optimal 
oi values, which shows, for a given number of rotated 
boxes ( x-axis, log scale), the corrective a' required (y 
axis, log scale) to remove EP error entirely from the 
result. Here the curve is not linear, nor are the optimal 
a' — m (though reasonably close, since slightly rotated 
boxes are nearly repeats). These results are not sur- 
prising, since there should be some error-free value in 
many cases. Nonetheless we see that even choosing 
a sensible a! correction can considerably improve the 
EP result in this redundancy example. 

Next, the extra mass issue can also be viewed in terms 
of cv-divergences, though the connection is not as rigor- 
ous as the redundancy issue. The extra mass problem 
is one of inclusivity: the Gaussian approximate factor 
is including mass that it should not (as that mass lies 
outside the true polyhedron), which is a property of 
inclusive a-divergences as mentioned previously, i.e., 
divergences with a > 1. These we would expect to 



overestimate the true probability, and indeed this is 
what we see in Figure 4B. We can also consider cor- 
recting for these a e ff with Power EP. However, the sit- 
uation is slightly more complicated here as the extra 
mass problem involves understanding how a c g changes 
with the decay of the Gaussian distribution. Results 
showing this are shown in Figure 6. This figure tells 
the same story as Figure 5: a sensible choice of correc- 
tion term will remove much of the EP error. 

Of course, the challenge is to come up with a theory or 
principled heuristics for choosing a[ in general polyhe- 
dral problems, as intuition tells us there should be no 
straightforward answer here. For almost-repeated con- 
straints and polyhedra in general, we have attempted 
to similarly improve EP by calculating a suitable a e ff 
term from problem parameters (the region and the 
Gaussian), and using Power-EP. Our early investiga- 
tion of this correction has given some promising re- 
sults, often reducing errors significantly. However, be- 
cause often a e ff < 1 and hence the correction term is 
greater than unity (unlike standard Power-EP), this 
EP algorithm encounters frequent convergence and 
stability issues, even with significant damping on the 
outbound messages (which does not alter the solution 
or a e g). This obstacle suggests two areas of future 
work: first, better characterisation of a e g from geo- 
metric quantities such as dot products of constraints. 
Second, addressing convergence issues via a provably 
convergent double-loop approach to EP (Opper and 
Winther, 2005; Seeger and Nickish, 2011). 

To summarise, both the redundancy and extra mass 
problems can be cast in the language of effective a- 
divergence. We know that minimising D a only returns 
moment matching when a = 1 (KL-divergence) , and 
thus any efforts to drive a c g closer to 1 should improve 
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all the results seen here. 

4.2 Perturbation perspective 

Another important direction of work in correcting EP 
is approximations to the remainder term R = ^° s z ^, p 
(Paquet et al., 2009; Oppcr et al., 2009; Cseke and 
Hcskcs, 2011). By making various expansions to this 
remainder term, we might hope to generally reduce 
EP errors. A particularly relevant expansion for the 
purposes of Gaussian probability calculations is from 
Paquet et al. (2009) and uses the error terms e^(x) = 
tPtt \ — 1, such that the correction term R becomes 

Ziti (x) 

a finite series of moments over products of increasing 
numbers of true factors and approximating factors. Of 
course this is intractable in general. However, in our 
problem setting these moments are again multivariate 
probabilities of increasing order, which allows us to 
consider a variety of interesting schemes. 

Our current direction of work is to use the Genz 
method in low-dimensions as a refinement to EP. This 
effectively allows us to capture all the benefits of the 
EP method, which scales well with dimension and 
log Z, while using numerical quadrature where it is 
fast and extremely accurate, namely in the low dimen- 
sional case. This creates a hybrid EP-Genz method, 
where we can trade between the two approaches on a 
spectrum determined by the depth of the expansion. 
An important caveat of these expansion methods is 
that they rely on an already-good approximation from 
EP. However, our results from Figure 2 indicate that 
this is likely acceptable in most all cases, as EP is al- 
ready highly accurate. At this time we do not have 
preliminary results to report on this approach. 

5 Discussion 

We have presented an EP algorithm constructing an- 
alytic approximations to polyhedral integrals on mul- 
tivariate Gaussian distributions. Such approximations 
highlight interesting connections between approximate 
inference and approximate integration, and these ap- 
proximations are of value for a number of applications 
in machine learning and throughout science and ap- 
plied statistics. We show that under a variety of cir- 
cumstances, relative EP errors rarely exceed 1%, and 
median relative errors are typically two to four or- 
ders of magnitude smaller. While we spent much of 
the results exploring the regions of poor performance 
and possible corrections, the general strength of the 
method should not be neglected. EPMGP also has 
some nice features such as analytical derivatives, fast 
runtime, and a natural ability to compute tail proba- 
bilities. On the other hand, existing numerical meth- 



ods, in particular the Genz method, typically demon- 
strate high numerical accuracy. 

Our results put a qualifier on the results of pre- 
vious work using EP for evidence estimation, such 
as the Bayes Point Machine (Herbrich, 2002; Minka, 
2001a,b), which was thoroughly investigated only for 
a relatively low dimensional case (m = 3, n = 2 in 
our notation) and for prediction accuracy, not ground 
truth. Our results caution that in such applications, 
EP should be used with care in problems where con- 
straints outnumber dimensions, particularly m >• n. 

Another important existing connection point is Gaus- 
sian Process classification (Rasmusscn and Williams, 
2006; Kuss and Rasmussen, 2005). When a probit like- 
lihood (namely a univariate Gaussian cdf) is used, the 
data likelihood becomes a Gaussian probability prob- 
lem. The same is true of probit regression (Ochi and 
Prentice, 1984), where changing the representation of 
the more standard prior and likelihood reveals a high- 
dimensional Gaussian probability. Though these prob- 
lems are usually not considered Gaussian probabilities, 
it is another point of broad importance for EP and for 
Gaussian probabilities. Furthermore, as EP is often 
used in a different way in these problems, exploring 
the similarities between EP for Gaussian probabilities 
and EP for problems that can be cast as Gaussian 
probabilities is an interesting subject for future inves- 
tigation. 

Here we have investigated properties for EP as applied 
to Gaussian integration, but understanding how much 
of these issues stems from EP itself and how much from 
the application is an important clarification. Gaus- 
sian probabilities are particularly instructive, as they 
encourage the geometric interpretation that we have 
maintained throughout this paper. However, the is- 
sues of redundancy and extra mass are not specific to 
strict box functions, and the interpretation in terms of 
a e ff is no less applicable in the case of smooth func- 
tions. 

Perhaps the most interesting aspect of this work is that 
the Gaussian probability problem, by its simple geo- 
metric interpretation, allows clear investigation into 
the strengths and weaknesses of EP, which bears rele- 
vance to applications well beyond just Gaussian prob- 
abilities. As EP becomes a more popular approximate 
inference method, this cautionary result suggests that 
EP should not be trusted blindly in new applications. 

Acknowledgements 

We thank Tom Minka, Thore Graepel, Ralf Herbrich, 
Carl Rasmussen, Zoubin Ghahramani, Ed Snelson, 
Ulrich Paquet, and David Knowles for helpful dis- 
cussions. JPC was supported by the UK Engineer- 



John P. Cunningham, Philipp Hennig, Simon Lacoste-Julien 



ing and Physical Sciences Research Council (EPSRC 
EP/H019472/1); PH and SLJ were supported by a 
grant from Microsoft Research Ltd. 

References 

Bock, R. D. and Gibbons, R. D. (1996). High- 
dimensional multivariate probit analysis. Biomet- 
rics, 52(4):1183-1194. 

Cseke, B. and Heskes, T. (2011). Approximate 
marginals in latent Gaussian models. Journal of 
Machine Learning Res., 12:417-454. 

Gassman, H. I., Deak, I., and Szantai, T. (2002). Com- 
puting multivariate normal probabilities: A new 
look. J Comp Graph Stats, ll(4):920-949. 

Genz, A. and Bretz, F. (2009). Computation of Mul- 
tivariate Normal and t probabilities. Springer. 

Herbrich, R. (2002). Learning Kernel Classifiers. MIT 
Press. 

Huguenin, J., Pclgrin, F., and Holly A. (2009). Esti- 
mation of multivariate probit models by exact maxi- 
mum likelihood. Technical report, University of Lau- 
sanne, Institute of Health and Economics Manage- 
ment (IEMS). 

Roller, D. and Friedman, N. (2009). Probabilistic 
Graphical Models: Principles and Techniques. MIT 
Press. 

Kuss, M. and Rasmussen, C. (2005). Assessing approx- 
imate inference for binary Gaussian process classifi- 
cation. Journal of Machine Learning Res., 6:1679- 
1704. 

Lerman, S. and Manski, C. (1981). On the use of 
simulated frequencies to approximate choice prob- 
abilities. Structural analysis of discrete data with 
econometric applications, 10:305-319. 

Lesaffre, E. and Molenberghs, G. (1991). Multivariate 
probit analysis: a neglected procedure in medical 
statistics. Statistics in Medicine, 10(9):1391-1403. 

Liao, X., Li, H., and Carin, L. (2007). Quadratically 
gated mixture of experts for incomplete data clas- 
sification. In Proceedings of the 24th International 
Conference on Machine Learning. 

Minka, T. P. (2001a). Expectation Propagation for 
approximate Bayesian inference. Uncertainty in AI, 
pages 362-369. 

Minka, T. P. (2001b). A family of algorithms for ap- 
proximate Bayesian inference. MIT Ph.D. Thesis. 

Minka, T. P. (2004). Power EP. Technical report, 
Microsoft Research. 

Minka, T. P. (2005). Divergence measures and message 
passing. Microsoft Research Tech. Report MSR- TR- 
2005-173. 



Murray, I., Adams, R. P., and MacKay, D. J. C. 
(2010). Elliptical slice sampling. Proc. 13th Intl 
Conference on AISTATS. 

Ochi, Y. and Prentice, R. L. (1984). Likelihood 
inference in a correlated probit regression model. 
Biometrika, 71(3):531-543. 

Opper, M., Paquct, U., and Winther, O. (2009). Im- 
proving on expectation propagation. In Advances in 
Neural Information Processing Systems. MIT Press. 

Opper, M. and Winther, O. (2000). Gaussian pro- 
cesses for classification: Mean field algorithms. Neu- 
ral Comp., 12:2655-2684. 

Opper, M. and Winther, O. (2001). From naive mean 
field theory to the TAP equations. In Opper, M. and 
Saad, D., editors, Advanced Mean Field Methods: 
Theory and Practice. MIT Press, Cambridge, MA. 

Opper, M. and Winther, O. (2005). Expectation con- 
sistent approximate inference. Journal of Machine 
Learning Res., 6:2177-2204. 

Pakes, A. and Pollard, D. (1989). Simulation and the 
asymptotics of optimization estimators. Economet- 
rica, 57(5):1027-1057. 

Paquct, U., Winther, O., and Opper, M. (2009). 
Perturbation corrections in approximate inference: 
Mixture modeling applications. Journal of Machine 
Learning Res., 10:1263-1304. 

Plackett, R. L. (1954). A reduction formula for normal 
multivariate integrals. Biometrika, 41(3-4):351-360. 

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaus- 
sian Processes for Machine Learning. MIT Press, 
Cambridge. 

Seeger, M. (2008). Expectation progapation for expo- 
nential families. Technical Report, U. C. Berkeley. 

Seeger, M. and Nickish, H. (2011). Fast convergent al- 
gorithms for Expectation Propagation approximate 
Bayesian inference. Proc. 14th Intl Conference on 
AISTATS. 

Sinn, M. and Keller, K. (2010). Covariances of zero 
crossings in Gaussian processes. Theory Prob and 
Appl. To appear. 

Thiebaut, R. and Jacqmin-Gadda, H. (2004). Mixed 
models for longitudinal left-censored repeated 
measures. Comput Methods Programs Biomed, 
74(3):255-260. 



Approximate Gaussian Integration using Expectation Propagation 



Appendix: details for algorithm and equations 

For completeness, we include here a more detailed description of the EPMGP algorithm: 



Algorithm 1 EPMGP: Calculate F(A), the probability of po(x) on the region A. 

1: Initialise g(x) ({Z, fj,, £}; typically q(x) = p (x)) and factors U(x) with zero precision. 

2: while q(x) has not converged do 

3: for i <- 1 : m do 

4: form cavity local q^ l (x) by Equation 5. 

5: calculate moments of q\ l (x)ti(x) by Equations 6-7. 

6: choose i»(x) so g\*(x)fi(x) matches above moments by Equation 8. 

7: end for 

8: update /x, S with new i,(x) by Equation 9. 

9: end while 

10: calculate Z, the total mass of q(x) using Equation 10. 

11: return Z, the approximation of F(A). 



Given a current EP approximation q(x) = p n (x)Y\ i t i (x) = ZJ\f(x; /x, S) and individual approximate factors 
ti(x) = ZiJ\f(c[x;jli,af), we have the following rule for calculating the EP cavity distribution of Equation 3, 
(which is actually only needed along the c, direction in our rank-one case: q\ l (cfx) = Z^Af (cfx; /j,^, ct^)) - it 
is a Gaussian with parameters: 

A*\< = ^d ^^((cfSc,)- 1 -^ 2 )- 1 . (5) 

Setting the Cj to the cardinal axes returns a hyperrectangular (axis-aligned) integration region and a more 
conventional EP algorithm with the familiar cavity parameter forms (e.g., Equation 3.56 in Rasmussen and 
Williams (2006)). 

For the projection step given in Equation 4, we first need to compute the zeroth, first and second moments 
{Zi, ftcj, of Cjcf } of q\ l (x)ti(x) which can be done exactly using the error function: 

Zi = Verf(/3)-erf(a)Y ft - Mv + i^(exp{-a 2 } - exp{-/? 2 }), (6) 
^ v 7 Zi \f 2tt 

of = fi^ +0-^ + 4- ((Zi + Mv )exp{-a 2 } - (u, + ^ v )cxp{-/3 2 }) - ft 2 , (7) 

where we have everywhere above used the shorthand a = ll ^ and B — "V . These moments are used to 
update U(x) in the usual way, using standard manipulation of Gaussian random variables: 

U(x) = Z,AA(cfx;ft,a 2 ), (8) 
where ft = 5- 2 (<rr 2 ft - ct^ 2 ^ v ), of = (<jr 2 - o^ 2 )" 1 , 

Z 4 = ZiV^y/a^ + a 2 cxp{i( Mv - ft) 2 /(4 + a 2 )}. 

Now, by the definition of the approximation, we can calculate the new approximation q(x) as the product 
q(x) =p (x) rii*»(x): 

g(x) = ZJV(/i,E), where /lx = E^^m + ^^c;) , S = (if" 1 + J^c.cf ) , (9) 

»=i CTi »=i CTl 

where m, if are the parameters of po(x). These forms can actually be efficiently and numerically stably calculated 
in quadratic computation time in our implementation using the natural parameterisation of the Gaussian rather 
than the mean parameters, given a Cholcsky decomposition of K. Lastly, once the algorithm has converged, 
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we can calculate the normalisation constant of q(x) which is our quantity of interest - the approximation of the 
probability F(A): 



In the above we have broken this equation up into three lines to clarify that the normalisation term logZ has 
contribution from the prior p (x) (first line) , the approximate factors (the second line) , and the full approximation 
g(x) (third line). An intuitive interpretation is that the integral of q(x) is equal to the product of all the 
contributing factor masses (the prior and the approximate factors ij), divided by what is already counted by the 
normalisation constant of the final approximation q(x). Similar reparameterisations of these calculations help to 
ensure numerical stability and quicker, more accurate computation. In this reparameterisation, the only cubic 
time operation is the computation of the log determinant and a cholesky factorization of K- we never have to 
invert or solve for £ explicitly during the algorithm. 




+ i( At T £-V + log|£|) 



(10) 



